86 research outputs found
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Most existing text-video retrieval methods focus on cross-modal matching
between the visual content of videos and textual query sentences. However, in
real-world scenarios, online videos are often accompanied by relevant text
information such as titles, tags, and even subtitles, which can be utilized to
match textual queries. This insight has motivated us to propose a novel
approach to text-video retrieval, where we directly generate associated
captions from videos using zero-shot video captioning with knowledge from
web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated
captions, a natural question arises: what benefits do they bring to text-video
retrieval? To answer this, we introduce Cap4Video, a new framework that
leverages captions in three ways: i) Input data: video-caption pairs can
augment the training data. ii) Intermediate feature interaction: we perform
cross-modal feature interaction between the video and caption to produce
enhanced video representations. iii) Output score: the Query-Caption matching
branch can complement the original Query-Video matching branch for text-video
retrieval. We conduct comprehensive ablation studies to demonstrate the
effectiveness of our approach. Without any post-processing, Cap4Video achieves
state-of-the-art performance on four standard text-video retrieval benchmarks:
MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is
available at https://github.com/whwu95/Cap4Video .Comment: Accepted by CVPR 2023. Selected as a Highlight (Top 2.5% of ALL
submissions
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Vision-language models (VLMs) pre-trained on large-scale image-text pairs
have demonstrated impressive transferability on various visual tasks.
Transferring knowledge from such powerful VLMs is a promising direction for
building effective video recognition models. However, current exploration in
this field is still limited. We believe that the greatest value of pre-trained
VLMs lies in building a bridge between visual and textual domains. In this
paper, we propose a novel framework called BIKE, which utilizes the cross-modal
bridge to explore bidirectional knowledge: i) We introduce the Video Attribute
Association mechanism, which leverages the Video-to-Text knowledge to generate
textual auxiliary attributes for complementing video recognition. ii) We also
present a Temporal Concept Spotting mechanism that uses the Text-to-Video
expertise to capture temporal saliency in a parameter-free manner, leading to
enhanced video representation. Extensive studies on six popular video datasets,
including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show
that our method achieves state-of-the-art performance in various recognition
scenarios, such as general, zero-shot, and few-shot video recognition. Our best
model achieves a state-of-the-art accuracy of 88.6% on the challenging
Kinetics-400 using the released CLIP model. The code is available at
https://github.com/whwu95/BIKE .Comment: Accepted by CVPR 202
3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal
Estimating 3D interacting hand pose from a single RGB image is essential for
understanding human actions. Unlike most previous works that directly predict
the 3D poses of two interacting hands simultaneously, we propose to decompose
the challenging interacting hand pose estimation task and estimate the pose of
each hand separately. In this way, it is straightforward to take advantage of
the latest research progress on the single-hand pose estimation system.
However, hand pose estimation in interacting scenarios is very challenging, due
to (1) severe hand-hand occlusion and (2) ambiguity caused by the homogeneous
appearance of hands. To tackle these two challenges, we propose a novel Hand
De-occlusion and Removal (HDR) framework to perform hand de-occlusion and
distractor removal. We also propose the first large-scale synthetic amodal hand
dataset, termed Amodal InterHand Dataset (AIH), to facilitate model training
and promote the development of the related research. Experiments show that the
proposed method significantly outperforms previous state-of-the-art interacting
hand pose estimation approaches. Codes and data are available at
https://github.com/MengHao666/HDR.Comment: ECCV202
Terahertz Sensor via Ultralow-Loss Dispersion-Flattened Polymer Optical Fiber: Design and Analysis
A novel cyclic olefin copolymer (COC)-based polymer optical fiber (POF) with a rectangular porous core is designed for terahertz (THz) sensing by the finite element method. The numerical simulations showed an ultrahigh relative sensitivity of 89.73% of the x-polarization mode at a frequency of 1.2 THz and under optimum design conditions. In addition to this, they showed an ultralow confinement loss of 2.18 × 10−12 cm−1, a high birefringence of 1.91 × 10−3, a numerical aperture of 0.33, and an effective mode area of 1.65 × 105 μm2 was obtained for optimum design conditions. Moreover, the range dispersion variation was within 0.7 ± 0.41 ps/THz/cm, with the frequency range of 1.0–1.4 THz. Compared with the traditional sensor, the late-model sensor will have application value in THz sensing and communication
- …